Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Abstract

Robust benchmarks are crucial for evaluating Multimodal Large Language Models(MLLMs). Yet we find that models can ace many multimodal benchmarks withoutstrong visual understanding, instead exploiting biases, linguistic priors, andsuperficial patterns. This is especially problematic for vision-centricbenchmarks that are meant to require visual inputs. We adopt a diagnosticprinciple for benchmark design: if a benchmark can be gamed, it will be.Designers should therefore try to ``game'' their own benchmarks first, usingdiagnostic and debiasing procedures to systematically identify and mitigatenon-visual biases. Effective diagnosis requires directly ``training on the testset'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnosebenchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology.Our primary diagnostic tool involves fine-tuning a powerful Large LanguageModel via $k$-fold cross-validation on exclusively the non-visual, textualinputs of the test set to reveal shortcut performance and assign each sample abias score $s(x)$. We complement this with a lightweight Random Forest-baseddiagnostic operating on hand-crafted features for fast, interpretable auditing.Second, we debias benchmarks by filtering high-bias samples using an``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to fourbenchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasivenon-visual biases. As a case study, we apply our full framework to createVSI-Bench-Debiased, demonstrating reduced non-visual solvability and a widervision-blind performance gap than the original.

Quick Read (beta)

loading the full paper ...